Understanding the migrant death issue

The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. I’ve edited the data set to include all recorded migrant deaths between 2000 up through March 2025. The objective of this project is to give you continued experience in working with and making sense of quantitative data, while at the same time, giving you the opportunity to better understand at a deep level, the nature of the migrant death issue. Above all else, my goal is to always honor and humanize those who have died in the desert. These data permit a better understanding of the death crisis.

For purposes of this project, you will be asked to produce a number of high-quality visualizations as well as compute some basic univariate statistics. Your job is to take what I have provided, analyze it, and tell the story I want to know something important, interesting, and useful about the migrant death crisis given the tasks assigned to you. Ultimately, when presented with information, you need to get practice in learning how to engage it. While this assignment is tethered to the migrant death data set, all skills used here would be transferrable to any data set.

What I am looking for here is something other than mechanical interpretations. I want to see creativity. What do I mean by creativity? To be blunt, creativity is not repeating numbers or statistics you produce in a table. I don’t need anyone to do that as I can see it with my own eyes. Creativity paints a picture as to what the human picture of the death crisis looks like, looking at the observed data. Answers that are mechanical are usually nonanalytical. Creativity means bringing in information and context outside of the confines of the specific charts, plots, or questions. In the end, if you simply report the results with no attempt at interpretation, I would not expect anything above a “C”-level grade (i.e. you just reproduce in words what I see in the tables or charts). In class, we will go over in great detail tips on how to avoid these problems, so follow those tips and you’re going to be well on your way.

This assignment is worth 800 points and is due by 11:59 PM on April 29.

Getting started

The data file is a csv file saved to my GitHub site.

## Rows: 4033 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): ML Number, Name, Sex, Reporting Date, Surface Management, Location...
## dbl  (7): Age, Corridor Code, Condition Code, Latitude, Longitude, UTM X, UTM Y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##   ML Number             Name               Sex                 Age       
##  Length:4033        Length:4033        Length:4033        Min.   : 0.00  
##  Class :character   Class :character   Class :character   1st Qu.:24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :30.00  
##                                                           Mean   :31.46  
##                                                           3rd Qu.:38.00  
##                                                           Max.   :99.00  
##                                                           NA's   :1398   
##  Reporting Date     Surface Management   Location         Location Precision
##  Length:4033        Length:4033        Length:4033        Length:4033       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Corridor Code     Corridor         Cause of Death     OME Determined COD
##  Min.   : 0.00   Length:4033        Length:4033        Length:4033       
##  1st Qu.: 6.00   Class :character   Class :character   Class :character  
##  Median : 7.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 7.03                                                           
##  3rd Qu.: 8.00                                                           
##  Max.   :13.00                                                           
##                                                                          
##  Condition Code Body Condition     Post Mortem Interval    State          
##  Min.   :1.00   Length:4033        Length:4033          Length:4033       
##  1st Qu.:1.00   Class :character   Class :character     Class :character  
##  Median :4.00   Mode  :character   Mode  :character     Mode  :character  
##  Mean   :3.84                                                             
##  3rd Qu.:6.00                                                             
##  Max.   :8.00                                                             
##                                                                           
##     County             Latitude       Longitude          UTM X       
##  Length:4033        Min.   :31.33   Min.   :-114.8   Min.   :143294  
##  Class :character   1st Qu.:31.75   1st Qu.:-112.3   1st Qu.:381244  
##  Mode  :character   Median :31.96   Median :-111.8   Median :424426  
##                     Mean   :32.00   Mean   :-111.8   Mean   :422503  
##                     3rd Qu.:32.21   3rd Qu.:-111.3   3rd Qu.:468782  
##                     Max.   :33.86   Max.   :-109.1   Max.   :685151  
##                                                                      
##      UTM Y        
##  Min.   :3466477  
##  1st Qu.:3513324  
##  Median :3536711  
##  Mean   :3540649  
##  3rd Qu.:3565118  
##  Max.   :3751034  
## 

Creating year variable from date function and inspecting migrant deaths by year

Because we may want to look at yearly data, it’s useful to generate a variable that records the calendar year in which migrant remains were found. The code in the chunk below does just this. In addition to creating the new variable, I use the R command tabyl to produce a table of migrant deaths by year. In all, there are about 4,000 recorded deaths in the time frame of the data set I’ve created. The table will show you the number of remains recovered by year along with the proportion of total deaths each year accounts for. So in 2010, we see 224 remains were recovered. The proportion of the total number of deaths accounted for by this year is 0.05476773 (i.e. \(\frac{224}{4090}\)). Multiply this proportion by 100 and you get the percent contribution. For 2010, about 5.5% of all the recovered remains occurred in 2010. One way to quickly assess the persistence of the death crisis is to inspect the proportions. If the crisis was abating, we’d expect to see a substantial decline in the proportion. If the crisis is persistent, we’d expect to see these proportions to be very similar across time. (Note that 2025 will be a very small number because we only have partial data for this year.) What do you see when you look at these proportions?

md$date <-  as.Date(md$`Reporting Date`,'%m/%d/%y')

md$year <- as.numeric(format(md$date,'%y'))

md$month <- as.numeric(format(md$date, '%m'))


#This ensures the year is reported as "2000, 2001, ... ,2024" instead of "0,1, ... , 24".
md$year20 <- 2000 + md$year



tabyl(md$year20)
##  md$year20   n     percent
##       2000  74 0.018348624
##       2001  77 0.019092487
##       2002 147 0.036449293
##       2003 155 0.038432928
##       2004 169 0.041904290
##       2005 196 0.048599058
##       2006 168 0.041656335
##       2007 214 0.053062237
##       2008 162 0.040168609
##       2009 190 0.047111332
##       2010 222 0.055045872
##       2011 178 0.044135879
##       2012 154 0.038184974
##       2013 165 0.040912472
##       2014 127 0.031490206
##       2015 135 0.033473841
##       2016 147 0.036449293
##       2017 118 0.029258616
##       2018 120 0.029754525
##       2019 136 0.033721795
##       2020 213 0.052814282
##       2021 215 0.053310191
##       2022 171 0.042400198
##       2023 197 0.048847012
##       2024 155 0.038432928
##       2025  28 0.006942723

Task 1: Visualizing migrant deaths

For this task, I want you to produce a barplot of migrant remains recovered by year. You can use the code below to generate the plot. Often (most always), it’s easier to visualize quantitative data than looking at a table of data. The code in the chunk below will create a barplot where each bar corresponds to the number of migrant remains recovered in each year. When you look at this plot, what do you see? What interpretation would you give to this? Does the plot show the crisis abating? Does it seem persistent? Is it getting worse? Your interpretation should be well-written, clear, and non-mechanical. The plot and the interpretation is worth 100 points.

labelsforx<- c("2000", "2001", "2002", "2003", "2004",              "2005", "2006", "2007", "2008", "2009", "2010",
                          "2011", "2012", "2013", "2014", "2015",
                          "2016", "2017", "2018", "2019", "2020",
                          "2021", "2022", "2023", "2024", "2025")

ggplot(md, aes(year20)) +
  geom_bar(fill = "lightskyblue4") +
     scale_x_continuous(breaks=seq(2000,2025,1), labels= labelsforx) +
    labs(title="Migrant remains recovered by year along the Arizona/Mexico border, 2000-2025", 
subtitle="Data from the Arizon OpenGIS Project (https://humaneborders.info/app/map.asp)",
          y="Number recovered", 
          x="Year") +
  theme_classic() +
    theme(axis.text.x = element_text(size=8, angle=45, hjust=1),
            axis.ticks = element_blank())

Task 1 answer should go here.

The migrant death crisis since the year 2000, appears not to be abating but persistent; however, it is not entirely clear whether the crisis is becoming worse or better. This is because throughout the last two decades the number of migrant deaths in terms of severity seemingly comes in waves, with periods of a high and low number of migrant remains recovered. For example, during the years 2002 to 2010, there was a stark increase in the number of remains recovered per year to approximately 175 to 200 remains recovered per year, during the years 2011 to 2019, there was a slight drop of average migrant remains recovered per year to approximately 100 to 125; however, from the year 2020 onward, the average number of migrant remains recovered saw an increase in its value to numbers similar to the 2002 to 2010 period. From the data, it appears that there is a difference between the years/time periods and the number of migrant remains recovered, there are many factors that could of caused this data either in actuality, such as different administrations and policies, or data collection, as there are many assumptions in the data, and many remains remain not found leading the data to only be a lower-bound estimate. Therefore, while there the number of migrant remains recovered differs by year, there are many factors that have contributed to this correlation to accurate state that there is a cause and effect between the year and remains recovered and there are no meaningful trends.

Task 2: Lethality by month

For this task, I want you to create a barplot of the number of migrant remains recovered by calendar month. Using the code in the previous chunk will be useful but it will need to be modified. After you create the plot, provide a substantive interpretation of the results. Your score for this will be based on quality of the plot (100 points) and quality of the write-up (100 points)

#Type code in this chunk
labelspermonth<- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "Demcember")

ggplot(md, aes(month)) +
  geom_bar(fill = "firebrick4") + 
  scale_x_continuous(breaks=seq(1,12,1), labels = labelspermonth) +
    labs(title="Migrant remains recovered by month along the Arizona/Mexico border, 2000-2025", 
subtitle="Data from the Arizon OpenGIS Project (https://humaneborders.info/app/map.asp)",
          y="Remains recovered", 
          x="Month") +
  theme_classic() +
    theme(axis.text.x = element_text(size=7, angle=45, hjust=1),
            axis.ticks = element_blank())

Task 2 answer should go here.

Statistically, more migrants attempt to cross the border during the Winter Months (January, February), logically more migrant deaths and remains on average should be found during said months. However, according to the data on the remains recovered on average per month, more remains were recovered during the Summer months (June, July). Months are directly correlated with a rise or fall in temperature, with the Summer Months having the highest temperatures, so in a desert, due to the extreme heat of the Summer months, logically, more migrants had died due to dehydration or heat exhaustion. Moreover, as due to to global warming, during the summer, where the desert becomes hottest, migrants face greater risk of death from dehydration or heatstroke. It could be argued that the data could be skewed due to remains found in one month could be scrutinizing the data since the remains could have been from a migrant that had died in the previous months; however, since this issue could occur in any month and has not reasonably skewed the data in a illogical fashion or had such a great effect on the data presented, the interpretation that heat and shifts in weather from the changing seasons contributed to the fact that more migrant remains were found during the summer months. Therefore, from the data, it is conclusive most migrants died during the Summer Months rather than the Winter Months due to the extreme heat in the desert.

Task 3: Gender and migrant deaths

How do migrant deaths and gender relate to one another? The code in the chunk below creates a “factor-level” variable recording the gender of the migrant. Since gender is not always determined, there is a category called “undetermined.” Using the tabyl function, I create a table showing the total number of remains recovered that are male, female, and undetermined. For this task, reproduce the code below and provide a brief interpretation of the results. This task is worth 50 points.

md$gender<-factor(md$Sex,
                  levels=c("male", "female", "undetermined"),
                  labels=c("male", "female", "undetermined"))
tabyl(md$gender)
##     md$gender    n   percent
##          male 3292 0.8162658
##        female  605 0.1500124
##  undetermined  136 0.0337218

Task 3 answer should go here.

According to the data on gender it appears there is a disproportionate number of the remains of male migrants (81%) found over the Arizona/Mexico border compared to the number of female migrant remains (15%) found. Logically, the ratio of male to female migrants should be around equal, meaning it is plausible to assume that there is an underlying reason for the gender difference in migrant remains found. A probable reason as to the gender difference could be due to Mexican culture and the role of males in the household. In a traditional Mexican family, due to Mexican society being patriarchal, the mother typically fulfills the role of a domestic caretaker while the father fulfills the role of a breadwinner and worker. Therefore it is more likely for the men of the family to attempt to cross the Arizona/Mexico border and subsequent deaths in search of financial security and opportunity in America compared to females as due to their respective traditional gender roles.

Task 4: Gender, death and time

On Github, there is a .csv file called mdg.csv. You will need to access this file. The code below will do this.

rm(list=ls(all=FALSE)) #Keep this line in 
rm(list=ls(all=TRUE)) #Keep this line in 

mdg="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/mdg.csv"

mdg<-read_csv(url(mdg))
## New names:
## Rows: 25 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): year, male, female, undetermined lgl (1): ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
summary(mdg)
##       year           male            female        undetermined   
##  Min.   :2000   Min.   :0.6400   Min.   :0.0300   Min.   :0.0000  
##  1st Qu.:2006   1st Qu.:0.7700   1st Qu.:0.1000   1st Qu.:0.0000  
##  Median :2012   Median :0.8100   Median :0.1400   Median :0.0000  
##  Mean   :2012   Mean   :0.8244   Mean   :0.1496   Mean   :0.0272  
##  3rd Qu.:2018   3rd Qu.:0.8800   3rd Qu.:0.2000   3rd Qu.:0.0300  
##  Max.   :2024   Max.   :0.9700   Max.   :0.2400   Max.   :0.1900  
##    ...5        
##  Mode:logical  
##  NA's:25       
##                
##                
##                
## 

This data set has 4 variables: year, male, female, and undetermined. The variables “male,” “female”, and “undetermined” give the proportion of remains recovered for each group by year. So if in a given year, 0.76 of the remains recovered were male, this implies 76 percent were male. To tidy up the plot I am going to ask you to create three new objects called “Male,” “Female,” and “Undetermined” that convert the proportions into percentages.

It may be useful to visualize these data by way of a line plot. Below is the start of some code to create a time-series plot. Provide the relevant code to complete the plot. Your plot, if properly done will have three lines corresponding to the three groups (Male, Female, Undetermined) and have informative main, \(y\), and \(x\) titling. After creation of the plot, provide a substantive interpretation of the results. What do we see, what do we learn?

R tasks of reading in the data, creating the new objects, and creating the plot will be worth 100 points. The analysis will be worth 100 points.

#### Task 4 answer should go here. (https://news.gallup.com/poll/1660/immigration.aspx) (https://19thnews.org/2024/07/women-migrants-deaths-us-mexico-border/)

According to the graph, most remains are male, and since the year 2000, there has been a steady increase in the ratio of male to female remains found, coming to a climax in between the year 2015 to 2020 before seeing a decline with number or male remains found compared to females and undetermined in between years 2020 to 2025. While there are many factors that could explain this change, between the years 2000 to 2017, growing American sentiment against immigration, and subsequent stricter policies from American presidents, climaxing with the policies and focus of the Mexico/American border under Trump, towards immigrants likely caused the greater number of male remains recovered. This is because with stricter policies and enforcement, means greater risk posed towards illegal migrants attempting to cross the Mexico/American border; therefore, in an effort to combat risk and curb loss, males from Mexico, rather than whole families, may have illegally attempted to cross the border alone, leading to a greater number of male remains compared to females remains to be found. However, since 2017, the number of female remains found has seen an uptick. While American sentiment remains roughly the same against migrants, shifts in Mexican culture towards greater equality and an increasingly unstable political and social climate in Mexico has likely spurred more women to be willing to cross the border for the same reasons as men. Especially since mothers in Mexico fear the risk of gang violence and recruitment for their children, compelling them to want to face the risk of crossing the border in the hope of a better future for their children.

Task 5: Visualizing deaths by month and year

I have created a dataset called “monthlydeathsbyyear.csv.” And it can be found on my Github site. The code in the chunk below will access the file.

rm(list=ls(all=TRUE)) #Keep this line in 

mdy="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/monthlydeathsbyyear.csv"

mdy<-read_csv(url(mdy))
## Rows: 299 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): year, month, obs, yearmonth, remainsmonth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(mdy)
##       year          month             obs      yearmonth       remainsmonth  
##  Min.   :2000   Min.   : 1.000   Min.   :1   Min.   :200001   Min.   : 1.00  
##  1st Qu.:2006   1st Qu.: 4.000   1st Qu.:1   1st Qu.:200605   1st Qu.: 7.00  
##  Median :2012   Median : 7.000   Median :1   Median :201207   Median :12.00  
##  Mean   :2012   Mean   : 6.512   Mean   :1   Mean   :201210   Mean   :13.39  
##  3rd Qu.:2018   3rd Qu.: 9.500   3rd Qu.:1   3rd Qu.:201810   3rd Qu.:16.00  
##  Max.   :2024   Max.   :12.000   Max.   :1   Max.   :202412   Max.   :69.00

In order to take advantage of some R capabilities with time series data, I am creating a date variable that will allow us to assess migrant remains recovered by month and year. The code in the chunk below will do this. I will discuss the logic of this in class.

mdy$ym<-as.Date(with(mdy,paste(year,month,obs, sep="-")),"%Y-%m-%d")
mdy
## # A tibble: 299 × 6
##     year month   obs yearmonth remainsmonth ym        
##    <dbl> <dbl> <dbl>     <dbl>        <dbl> <date>    
##  1  2000     1     1    200001            4 2000-01-01
##  2  2000     2     1    200002            7 2000-02-01
##  3  2000     3     1    200003            6 2000-03-01
##  4  2000     4     1    200004            5 2000-04-01
##  5  2000     5     1    200005           11 2000-05-01
##  6  2000     6     1    200006           15 2000-06-01
##  7  2000     7     1    200007            5 2000-07-01
##  8  2000     8     1    200008            7 2000-08-01
##  9  2000     9     1    200009            6 2000-09-01
## 10  2000    10     1    200010            4 2000-10-01
## # ℹ 289 more rows
#View(mdy)

The code in the chunk below will produce a time series plot of the migrant death data by month and year. The plot is saved as an object I called “p.” Run this code to verify it works.

# Usual area chart
p <- mdy %>%
  ggplot( aes(x=ym, y=remainsmonth)) +
  geom_area(fill="#69b3a2", alpha=0.5) +
  geom_line(color="#69b3a2") +
  labs(title="Migrant remains recovered on the Arizona/Mexico border, 2000-2025", 
subtitle="Data from the Arizon OpenGIS Project (https://humaneborders.info/app/map.asp)",
  y="Remains recovered", x="Year and month") +
  theme_bw()
p

Next, run this code and see what it is you produce! For this part of the task, write an interpretation of this plot. What do we learn substantively about the migrant death crisis over time? This write-up is worth 100 points.

# Turn it interactive with ggplotly
p <- ggplotly(p)
p

Task 5 write-up should go here.

Each year, the greatest maximum of remains found is consistently the first of July, pointing to the idea that during the summer, migrants face the most risk of death due to dehydration; however, from the data of migrant remains recovered over the years, it appears that overall, the migrant death crisis has waned slightly in terms of severity since the beginning of 2000. This is because in the early 2000s until 2010, the maximums of each year, especially in 2005, are far greater than years past 2010.

Task 6: See if you can do this!

What is this plot showing us? What is going on in the plot? This task is worth 50 points.

pal <- c("green", "blue", "red")

fig <-  plot_ly(data = mdy ,x =  ~ym, y = ~remainsmonth, color = ~remainsmonth, colors = pal, type = 'scatter', mode = 'markers')%>%
  layout(title = 'Scatterplot of migrant remains recovered by month and year', plot_bgcolor = "white",
               xaxis=list(range = list("2000-01-01", "2025-3-01")))

fig

Task 7: Univariate statistics

For this task, I want you to provide a substantive answer to the following questions (7.1 and 7.2 are worth 50 points each; 7.3 is worth 100 points):

Task 7.1: Statistics for all months/years

7.1. What is the mean, standard deviation, median, and IQR for the variable called “remainsmonth.” This variable records the number of deaths by month by year. What do we learn from these statistics? Is there evidence of skewness in the data?

#Insert code to answer this here
mean(mdy$remainsmonth) # = 13.39465
## [1] 13.39465
median(mdy$remainsmonth) # = 12
## [1] 12
sd(mdy$remainsmonth) # = 9.028187
## [1] 9.028187
IQR(mdy$remainsmonth) # = 9
## [1] 9
psych::describe(mdy$remainsmonth)
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 299 13.39 9.03     12    12.2 7.41   1  69    68 1.97     6.62 0.52

Task 7.1 answer should go here.

Task 7.2: Univariate statistics over time

7.2. Compute the mean number of remains recovered by year. Based on these results, what do we learn? What years stand out to you?

#Insert code to answer this here
aggregate(mdy$remainsmonth, list(mdy$year), FUN = mean)
##    Group.1         x
## 1     2000  6.166667
## 2     2001  7.000000
## 3     2002 12.250000
## 4     2003 12.916667
## 5     2004 14.083333
## 6     2005 16.333333
## 7     2006 14.000000
## 8     2007 17.833333
## 9     2008 13.500000
## 10    2009 15.833333
## 11    2010 18.500000
## 12    2011 14.833333
## 13    2012 12.833333
## 14    2013 13.750000
## 15    2014 10.583333
## 16    2015 11.250000
## 17    2016 12.250000
## 18    2017  9.833333
## 19    2018 10.000000
## 20    2019 11.333333
## 21    2020 17.750000
## 22    2021 17.916667
## 23    2022 14.250000
## 24    2023 16.416667
## 25    2024 12.916667

Task 7.2 answer should go here.

Task 7.3: Time-series box-plot

7.3. Create a boxplot of monthly remains recovered by year. Below is shell code to get you started. This will produce an unacceptable plot but it will get you on your way. Adorn this plot with titles, color (if you wish), and any other fancy things to make it more interesting. Following this, provide an interpretation of the plot. What do we learn from this plot about migrant remains recovered?

#Shell code

mdy$year<-as.factor(mdy$year)

bp<-ggplot(mdy, aes(remainsmonth, year)) + 
  geom_boxplot(outlier.color = "dodgerblue3", fill = "lightskyblue3") +
  labs(title="Monthly Remains Recovered by Year on the Arizona/Mexico border, 2000-2025",
     y="Year", x="Remains Recovered per Month") +
  theme_bw()
bp

Task 7.3 answer should go here.